Report on project "Your GAN is Secretly an Energy-based Model"¶

Made by

  • Efimov Stanislav
  • Boris Miheev
  • Elfat Sabitov

Github: https://github.com/MarioAuditore/Statistics-energy-based-GAN/tree/main

Disclaimer: if gif doesn't play in notebook, just visit the provided links (all animations are uploaded on github).

Motivation¶

  • Despite the ability of GANs to generate quality images, the samples of GANs often contain artifacts. In order to improve sample quality, a lot of different techniques were applied, however, all of them (DOT, MH-GAN, DRS) are either inefficient or lack theoretical guaranties. -To create a good theoretical understanding with valuable practical applications it is reasonable to change the perspective of viewing GANs
  • The GAN can be viewed as Energy-Based Models (implicitly)
  • But to work with this model in application to pictures, a lot of difficult problems need to be solved

Problem statement¶

  • The EBM interpretation allows an efficient collaborative learning for generator and discriminator, however direct sampling from this model is extremely challenging for several reasons. For example, there is no tractable closed form for the implicit EBM in pixel space.
  • Therefore the goal is to create theoretically grounded and sustainable way of generating quality samples from GAN.¶

Minimal theory of GANs¶

GAN - is a family of generative models defined through a minimax game between generator $G$ and discriminator $D$. $G$ takes a latent code $z$ from a prior distribution $p(z)$ and produces a sample $G(z)\in X$. The discriminator takes a sample $x \ in X$ as input and aims to classify real data from fake samples produced by the generator, $p_d$ denotes the true distribution and $p_g$ denotes an implicit distribution induced by prior and $G$. \ Standard non-saturating training objective for the discriminator is:

$$ L_D = - \mathbb{E}_{x\sim p_{data}}[\log D(x)] - \mathbb{E}_{x\sim p_z}[\log (1 - D(G(z)))] $$ And for generator: $$ L_G = \mathbb{E}_{x\sim p_z}[\log (D(G(z)))] $$

GANs as Energy based models¶

Energy-based model¶

Boltzman distribution $p(x) = \exp \left(\frac{-E(x)}{Z} \right)$, where $x\in X$, $X$ in the state space, $E(x) : X \rightarrow \mathbb{R}$ is the energy function defines EBM.

GAN as EBM¶

  • GAN rarely converges to the optimal generator $G$ during adversarial game
  • Assume that $p_d$ and $p_g$ have the same support. Let's assume that $D$ is near optimality, $$D(x) \approx \frac{p_d(x)}{p_d(x) + p_g(x)} = \frac{1}{1 + \frac{p_g(x)}{p_d(x)}} \approx \frac{1}{1 + \exp \left (-\frac{p_g(x)}{p_d(x)}\right)} \Rightarrow p_d(x) = p_g(x)\exp \left (\frac{p_g(x)}{p_d(x)}\right) $$

$$ p_d^* = \frac{p_g (x)\exp \left (\frac{p_g(x)}{p_d(x)}\right)}{Z_0}$$

Therefore if $D^* = D$ then $p_d^* = p_d$ and it corrects the bias in the generator via weighting and normalization.

Sampling methods¶

The original paper describes several ways of sampling. In our experiments we used Discriminator Langevin Sampling.

Energy functional is defined as $$E(z) = -\log p_0(z) - D(G(z)),$$ where $z$ - noise, $D$ - discriminator and $G$ - generator.

The sampling algorithm is defined as follows:

Algorithm:
--------------------
Input: N, eps > 0
Output: z_N ~ p_t(z) 
--------------------
Sample z ~ p_0(z)

for i in range(N):
    n_i ~ Normal(0, 1)
    z_{i+1} = z_i - ε/2 * ∇_z E(z) + n_i * √ε
end for

The idea can be developed further and other sampling techniques based on energy functional can be applied.

Experiments¶

We conducted experiments on models and datasets from the paper. Also, we tried to use one of the scores presented in the paper - Frechet Inception Distance (FID). It uses pretrained classification model to classify both real and generated images. FID takes last layer representations of given images and compares them. The better the GAN, the closer these representations will be. In our case we used FID from library torcheval (link).

Celeba¶

Experiment on github: https://github.com/MarioAuditore/Statistics-energy-based-GAN/blob/main/Celeba-GAN.ipynb

Celeba is a popular dataset of celebrities' faces. Our idea was to find pre-trained generator and discrimantor models and to apply sampling techniques to them. First of all, here is the default generation of faces from noise (numbers above are discriminator scores):

image.png

For this batch FID = 301.8684

Then we apply Langevin Dynamics Sampling to the noise generated in latent space. We experiment with two hyperparameters:

  • eps stand for step size
  • N stand for number of sampling iterations

Below are results for different N and eps:

N = 20 | eps = 1e-1 | FID = 293.1780

image.png

N = 20 | eps = 1e-3 | FID = 296.0798

image.png

N = 20 | eps = 1e-5 | FID = 306.0943

image.png

N = 20 | eps = 1e-7 | FID = 304.4666

image.png

N = 50 | eps = 1e-7 | FID = 306.2776

image.png

As you can see, artifacts are still out there and FID score is somteimes higher after sampling, that is a problem. All in all, to at least understand what the sampling does we made an animation: https://github.com/MarioAuditore/Statistics-energy-based-GAN/blob/main/images/celeba_langevin_gen.gif

now it is clear that on small step sizes the samling algorithm slightly adjusts the colour of the image. Now we can have a look at same animation, but with eps = 1e-1: https://github.com/MarioAuditore/Statistics-energy-based-GAN/blob/main/images/celeba_langevin_extreme.gif

As we can see, the sampling algorithm walks through various faces and doesn't stop on any of those, well, except that galucination artefact, which seems to be large in the latent space. Pretty interesting.

CIFAR10¶

Another dataset, which is present in the original paper is CIFAR10. This dataset contains images of different objects, belonging to one of the ten classes. Here, once again, we found the pre-trained Generator and Discriminator models and performed similar experiments.

N = 0 | eps = 0 | FID = 126.0720

image.png

N = 10 | eps = 1e-1 | FID = 138.7999

image.png

N = 10 | eps = 1e-3 | FID = 119.6181

image.png

N = 10 | eps = 1e-5 | FID = 123.5512

image.png

N = 10 | eps = 1e-7 | FID = 123.4056

image.png

N = 15 | eps = 1e-7 | FID = 106.6306

image.png

Once, again, FID here is not stable, so one should (potentially) use their own classifier model for accurate scores instead of default inception.

Also, we wanted to see how image changes during sampling with small step size: https://github.com/MarioAuditore/Statistics-energy-based-GAN/blob/main/images/cifar_langevin_gen.gif

And sampling with bigger step size allows us observe the variety of images in latent space: https://github.com/MarioAuditore/Statistics-energy-based-GAN/blob/main/images/cifar_langevin_extreme.gif

WGAN¶

Consider the problem of approximating distribution $p_d(x)$, $x \in \mathbb{R}^d$. Let $p_0(z)$, $z \in \mathbb{R}^m$, $m \leq d$ be a latent variable, $\mathcal{G}$ (class of generators) be a class of transformations $G: \mathbb{R}^m \rightarrow \mathbb{R}^d$ and $\mathcal{D}$ be a family of discriminators $D: \mathbb{R}^d \rightarrow \mathbb{R}$, where $\mathbb{D} \subset Lip(1)$.

Then Wasserstein GAN (WGAN) aims at solving the minimax problem:

$$ \mathbb{E}_{X\sim p_d} \left( D(X) \right)-\mathbb{E}_{Z\sim p_0}\left( D(G(Z))\right) \rightarrow \min\limits_{G \in \mathcal{G}} \max\limits_{D \in \mathcal{D}} $$

Screenshot from 2023-12-22 17-36-45.png

Should be 1-Lipschitz function: $|D(x)-D(y)|\leq ||x-y||_2 \Rightarrow$ Gradient penalty $(||\nabla_{\hat{x}} D_{\omega}(\hat{x})||_2 - 1)^2 \rightarrow \min\limits_{\omega}$

WGAN as an Energy-based model¶

Let $p_G(x)$ be a distribution of the fake images $G(Z)$, $Z \sim p_0$

Recall that, given densities $p(x)$ and $q(x)$,

$$ KL\left(p(x)||q(x)\right)=\int_{\mathbb{R}^d}p(x)\log\frac{p(x)}{q(x)}dx $$

We aim approximate the data distribution $p_d$ by the distribution

$$ p_t(x)=p_G(x)\frac{\exp(D(x))}{\int_{\mathbb{R}^d}p_G(x)\exp(D(x))dx}=\frac{p_G(x)\exp(D(x))}{Z_0}, $$

and at the same time to approximate $p_t(x)$ by $p_G(x)$ in the KL-metric.

WGAN as an Energy-based model¶

Note that:

$$ \begin{cases} KL(p_d || p_t) \rightarrow \min\limits_{D \in \mathcal{D}}\\ KL(p_G||p_t) \rightarrow \min\limits_{G \in \mathcal{G}} \end{cases} \Leftrightarrow \begin{cases} \mathbb{E}_{Z\sim p_0}\left(D(G(Z)) \right) - \mathbb{E}_{X \sim p_d}\left(D(X) \right)\rightarrow \min\limits_{D \in \mathcal{D}}\\ -\mathbb{E}_{Z \sim p_0}\left( D(G(Z))\right)\rightarrow \min\limits_{G \in \mathcal{G}} \end{cases}, $$

which is equivalent to WGAN training objective

Hence, given $p_d \approx p_t$ and $p_G \approx p_t$, WGAN approximates the data distribution $p_d$ by the distribution $p_t(x)=p_G(x)\frac{\exp(D(x))}{Z_0} $

Its counterpart in the latent space of the generator is given by

$$ q(z)=p_0(z)\frac{\exp(D(G(z)))}{Z_0}, $$

which can be written as an energy-based model $q(z)=\exp(-E(z))$ with $E(z)=-\log p_0(z)-D(G(z))$

Rejection sampling¶

Sample from $q(z)=p_0(z)\frac{\exp(D(G(z)))}{Z_0}$, where $Z_0=\int p_0(y) \exp(D(G(y)))dy$

Take $p_0(z)$ as a proposal and define $M=\max\limits_{z}\frac{q(z)}{p_0(z)}$

Generate candidate $Z \sim p_0$ and accept it with probability:

$$ \alpha = \frac{\exp(D(G(Z)))}{MZ_0}=\frac{\exp(D(G(Z)))}{\max\limits_{y}\exp(D(G(y)))} $$

Summing up¶

  • During this project we analyzed the original paper and theory related to it.
  • We performed experiments on
    • Celeba
    • CIFAR10
  • Experimented with sampling hyperparameters
  • Made animations of how sampling changes the generated image for both small and large step size
  • Tried to reproduce the results for WGAN
  • Made an attempt to measure the quality of GAN with FID score